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^ ; Abstract. An automated, rapid classification of transient events detected in the modern synop- 
. Q tic sky surveys is essential for their scientific utility and effective follow-up using scarce resources. 
, This problem will grow by orders of magnitude with the next generation of surveys. We are ex- 
■ ploring a variety of novel automated classification techniques, mostly Bayesian, to respond to 
these challenges, using the ongoing CRTS sky survey as a testbed. We describe briefiy some of 
the methods used. 



The increasing number of synoptic surveys are now generating tens to hundreds of 
(-H ' transient events per night, and the rates will keep growing, possibly reaching millions of 
rS |. transients per night within a decade or so. Generally, follow-up observations are needed 
^ ' in order to fully exploit scientifically these data streams. In optical surveys, for instance, 
5_i , all transients look the same when discovered - a starlike object that has changed its 
brightness significantly ~ and yet, they could represent vastly different physical phenom- 
. ena. Which ones are worthy of a follow-up? This is a critical issue for the massive event 
streams (e.g., LSST, SKA, etc.), and the sheer volume requires an automated approach 
^ (Don alek et al. 2008[ IMahabal et al. 20101 [Djorgovski et al. 2011a"| . 
J> ' The process of scientific measurement and discovery typically operates on time scales 
from days to decades after the original measurements, feeding back to a new theoretical 
understanding. However, that clearly would not work in the case of phenomena where a 
^y-^ . rapid change occurs on time scales shorter than what it takes to set up the new round of 
• ' measurements. This results in the need for real-time systems, consisting of computational 
I . analysis and decision engine, and optimized follow-up instruments that can be rapidly 
' deployed with immediate analysis and feedback. These requirements imply a need for an 
T-H automated classification and decision making. 

^ The classification process for a given transient involves: (1) obtain available contex- 

• ^ tual archival information, and combine it with the measured parameters from the dis- 
p% covery pipeline, (2) determine (relative?) probabilities or likelihoods of it belonging to 
^ . various classes of transients, (3) obtain follow-up observations to best disambiguate com- 
" " ' peting classes, (4) use them as a feedback and repeat for an improved classification. 
We describe below a few techniques that help in this process. Our principal data set 
is the transient event stream from the Catalina Real-time Transient Survey (CRTS; 



http://crts.caltech.edu[lDrake et al. 19991 [Djorgovski et al. 201 lb| IMahabal et al. 201ip . 



but the methodology we are developing is more universally applicable. 
Bayesian Networks: Generally, the available data for any given event would be het- 
erogeneous and incomplete. That is difficult to accommodate in the standard machine- 
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learning feature vector approach, but it can be naturally accommodated in a Bayesian 
approach, such as Bayesian Networks (BN) (jMahabal et al. 2008P . 

We have used three colors obtained from the Palomar 60-inch telescope from follow- 
up observations of CRTS transients, and two contextual parameters: Galactic latitude 
and proximity to a galaxy. Priors for six classes have been used: CVs, SNe, Blazars, 
other AGN, UV Ceti, and the "Rest" (everything else). We are currently adding more 
parameters and classes. About 300 objects each have been used for SN and CV, and 
~100 for blazars. The number statistics for other AGN and UV Ceti are still too small. 
82% of the objects classified as SN are indeed SN (79% for CVs, 69% for Blazars). The 
contamination is ^ 10 — 20%. Given the fact that a single set of observations accomplished 
this, the potential for extending the BN, and combining its output with other techniques 
is very promising. 

Lightcurve Based Classification: Structure in a sparse and/or irregular light curves 
(LC) can be exploited by automated classification algorithms. This can be done by col- 
lecting LCs for different objects belonging to a class and representing and encoding the 
characteristic structure probabilistically in the form of an empirical probability distribu- 
tion function (PDF). This can then be used for subsequent classification of a LC with 
even a few epochs. Moreover, this comparison can be made incrementally over time as 
new observations become available, with the final classification scores improving with 
each additional set of observations. This forms the basis for a real time classification 
methodology. Since the observations come in the form of flux at a given epoch, for each 
point after the very first one we can form a (Sm, 6t) pair. We focus on modeling the joint 
distribution of all such pairs of data points for a given LC. By virtue of being incre- 
ments, the empirical probability density functions of these pairs are invariant to absolute 
magnitude and time shifts, a desirable feature. Upper limits can also be encoded in this 
methodology, e.g., forced photometry magnitudes at a SN location in images taken before 
the star exploded. We currently use smoothed 2D histograms to model the distribution 
of elementary (dm, dt) sets. In our preliminary experimental evaluations with a small 
number of object classes (single outburst like SNe, periodic variable stars like RR Lyrae 
and Miras, as well as stochastic variables like blazars and CVs) we have been able to 
show that the density models for these classes are potentially a powerful method for 
object classification from sparse/irregular LC data. 

Currently we are using the (dm, dt) distributions for classification in a binary mode 
i.e. successive two-class classifiers in a tree structure SNe are first separated from non- 
SNe (the easiest bit, currently performing at a ~99% completness) , then non-SNe are 
separated into stochastic versus non-stochastic variables, and then each group further 
separated into more branches. The most difficult so far has been the CV-blazar node 
(based on just the {dm,dt) density i.e., without bringing in the proximity to a radio 
source since we are also interested in discovering blazars that were not active when 
the archival radio surveys were done). Currently this classifier is performing at a ^-^71% 
completness. We are also exploring Genetic Algorithms to determine the optimal {dm, dt) 
bins for different classes. This will in turn help optimise follow-up observing intervals for 
specific classes; see, e.g.. IMahabal et al. 2011l or |Djorgovski et al. 2011a| 
Incorporating the Contextual Information: Contextual information can be highly 
relevant to resolving competing interpretations: for example, the light curve and observed 
properties of a transient might be consistent with it being a cataclysmic variable star, 
a blazar, or a SN. If it is subsequently known that there is a galaxy in close proximity, 
the SN interpretation becomes much more plausible. Such information, however, can be 
characterized by high uncertainty and absence, and by a rich structure: if there were 
two galaxies nearby instead of one then details of galaxy type and structure and native 
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stellar populations become important, e.g., is this type of SN more consistent with being 
in the extended halo of a large spiral galaxy or in close proximity to a faint dwarf galaxy? 
The ability to incorporate such contextual information in a quantifiable fashion is highly 
desirable. We have been compiling priors for such information as well. These then get 
incorporated into the Bayesian network mentioned earlier. 

We are also investigating the use of crowdsourcing (citizen science) as a means of har- 
vesting the human pattern recognition skills, especially in the context of capturing the 
relevant contextual information, and turning them into machine-processable algorithms. 
A methodology employing contextual knowledge forms a natural extension to the logistic 
regression and classification methods mentioned above. This will be necessary for larger 
future surveys where the data flow will exceed the available human resources, and more- 
over, it would make such classification objective and repeatable. It also represents an 
example of a human-machine collaborative discovery process. 

Transients can also be found using the technique of image subtraction using a matched 
older observation, or a deeper co-added image ([Drake et al. 1999|) . If the images are 
properly matched, transients stand out as a positive residual. When used with white 
light as is the case with CRTS, the difference images tend to have bipolar residuals 
thus leading to false detections. We have been experimenting with these to look for su- 
pernovae in galaxies using citizen science where a few amateur astronomers regularly 
look at the galaxy images along with the residuals presented to them A large num- 
ber of SNe have been found in this fashion (see IPrieto et al. 2011] for an example, and 



A given classifier may not be optimal for all classes, nor to all types of inputs. That is 
the primary reason why multiple types of classifiers have to be employed in the complex 
task of classifying transients in real time. Presence of different bits of information can 
trigger different classifiers. In some cases more than one classifier can be used for the 
same kinds of inputs. An essential task, then, is to derive an optimal event classification, 
given inputs from a diverse set of classifiers such as those described above. Combining 
different classifiers with different number of output classes and in presence of error-bars 
is a non-trivial task and is still under development. 
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